Exploratory Data Analysis 2 and Feature Engineering
Author
Edimer David Jaramillo
Published
February 29, 2024
Document description
In this document, exploratory data analysis is oriented to variables that change over time. In the document 05-EDA1.qmd several of the results make use of the response variable bloom_doy summarized with the median, in this document the objective is to explore the temporal variation of the response variable and the association with climatic variables (df_weather) and photoperiod (df_photoperiod).
Libraries and setup
Code
# Librarieslibrary(tidyverse)library(arrow)library(glue)library(reactable)library(latex2exp)library(splines)library(forecast)library(plotly)# Colorscolors_custom <-c("#014e25","#800080","#ffa500","#008080","#ff6347","#0000cd")# Theme ggplot2theme_set(theme_bw() +theme(legend.position ="top"))# Functions## These functions are not required in this documentfunctions_exclude <-c("../source/r/module-EDA","../source/r/extractPhotoperiod.R","../source/r/getWeatherPOWER.R" )dir_functions <- fs::dir_ls("../source/r/")c(dir_functions[!dir_functions %in% functions_exclude], fs::dir_ls("../source/r/module-EDA/")) |>walk(.f = source)# Inputscountries <-c("Japan", "Switzerland", "South Korea", "USA-WDC")
Databases
In total there are 490 different coordinates.
df_full: complete data with locations (location and country), coordinates and response variable. I select only the variables of interest for this analysis.
Total rows: 17966
Total columns: 8
df_weather: weather data for each coordinate
Total rows: 7738651
Total columns: 17
Date range: 1981-01-01 until 2024-02-25
df_photoperiod: photoperiod data for each coordinate.
Total rows: 9456891
Total columns: 3
Date range: 0811-04-02 until 2023-12-30
From the information provided by NPN (df_npn) I take the Site_ID == 32789 and the Species_ID == 228 to identify the records of the site of interest in New York.
Data structuring and initial Feature Engineering (FE)
In the database data_bloom_year I keep the coordinates of interest and the year of registration.
I obtain a table with summary metrics (data_summary_weather) per year (since 1981) for each coordinate and for each of the climate variables, the calculated metrics were the average, median, messtandard deviation, minimum and maximum. With this table it is possible to relate climatic information from the same flowering year or information from delayed years. Finally I create a database (data_weather_yearly) with the response variable bloom_doy, the coordinates, the year and the climate information. This table could also be used for model building.
I obtain a table with summary metrics (data_summary_weather_monthly) by year and month (since 1981) for each coordinate and for each of the climate variables, the calculated metrics were the average, median, messtandard deviation, minimum and maximum. With this table it is possible to relate climatic information from the same year and month of flowering or information from delayed years and months. Finally I create a database (data_weather_yearly) with the response variable bloom_doy, the coordinates, the year, month and the climate information. This table could also be used for model building.
Having climatic variables summarized on a monthly basis has advantages and disadvantages. Personally I think that the greatest advantage comes from the level of detail in the climate profile of the locations, however, since there are 9 climate variables, three summary metrics (average, median and standard deviation) and 12 months implies that 324 new variables are generated, to these variables are added the 12 that I obtain with the variable FROST_DAYS for a total of 336 possible predictors. As many of these variables are correlated, using multivariate techniques for dimensionality reduction allows improving characterization in a smaller dimensional space.
Although I have photoperiod information from the year 81 (one year ago from the first date at each latitude), taking into account that the climate information is from 1981, I also apply this filter to the photoperiod records. I obtain three summary metrics (average, median and standard deviation) for the photoperiod by year (data_photop_coords_yearly) and by year-month (data_photop_coords_monthly), with which 36 new variables are generated.
I obtain a table that shows the range of dates for which there is a record for each location (coordinate), the number (n) of years for which there is a record of flowering is also added.
The record with the minimum flowering date is April 1, 812.
Some locations do not have updated information. The coordinate Lat: 46.52733 - Long: 8.932936 has 10 records from 1978 to 1987, being the coordinate with the “most outdated” record.
To calculate the difference (in days) between flowering \(t_i\) with \(t_{i-1}\) I take into account coordinates that meet the following characteristics:
Minimum 2 records \((n > 1)\)
Difference between years equal to 1, that is, I only keep coordinates whose blooms have been reported with an annual frequency, it does not matter if it was 20 years ago, the important thing is that there is information year after year.
If we call \(\gamma\) the bloom date (bloom_date) that has measurements in a year \(t\), with \(t = 1, 2, ..., k\), where \(k\) is the number of records that meet the two restrictions described previously, the distribution shown in the following graphs is the difference of \(\gamma_t\) with \(\gamma_{t-1}\), that is, the lag operator with \(p = 1\) on the difference in days between flowerings. It is important to mention that the user will be able to choose the lag value to generate \((\gamma _{t-p})\), in which case for the first condition \(n\) is expressed as \(n = p + 1\); The second condition remains the same regardless of the value of \(p\).
In the following graphs the response variable in year \(t\) is related to \(t-i\). Unlike the previous graphs in these I use the variable bloom_doy and evaluate a decade, that is, \(i = 1, 2, ..., 10\).
The meteorological data were obtained with R from the Prediction of Worldwide Energy Resources (POWER) project, using the nasapower for extraction of time series (from 1981-01-01 to 2024-02-25) with daily frequency of the following variables:
Average Earth Surface Temperature (TS)
Maximum temperature of the earth’s surface (TS_MAX)
Minimum land surface temperature (TS_MIN)
Evaporation from the earth’s surface (EVLAND)
Frost days (FROST_DAYS)
Precipitation (PRECTOTCORR)
Soil moisture profile (GWETPROF)
Relative humidity at two meters (RH2M)
Soil moisture in the root zone (GWETROOT)
The total amount of ozone in a column extending vertically from the Earth’s surface to the top of the atmosphere (TO3)
As the POWER project is aimed at three user communities, in this case the Agroclimatology community (AG) is used.
For these graphs I filter out records that are not from New York since the DOY is very high compared to the other four sites.
It is important to mention that these graphs are built with the data summarized by year (data_weather_yearly), however, later the relationship of the climate summarized on a monthly basis (data_weather_monthly) is evaluated.
---title: "International Cherry Blossom Prediction Competition"subtitle: "Exploratory Data Analysis 2 and Feature Engineering"author: "Edimer David Jaramillo"date: "`r lubridate::now()`"lang: en-USformat: html: page-layout: article toc: true code-fold: true df-print: paged toc-location: left number-depth: 4 theme: yeti code-copy: true highlight-style: github embed-resources: true code-tools: source: true ---```{r}#| label: setup#| include: falseknitr::opts_chunk$set(echo =TRUE,warning =FALSE,error =FALSE, message =FALSE,fig.align ='center')```# Document description- In this document, exploratory data analysis is oriented to variables that change over time. In the document **05-EDA1.qmd** several of the results make use of the response variable `bloom_doy` summarized with the median, in this document the objective is to explore the temporal variation of the response variable and the association with climatic variables (`df_weather`) and photoperiod (`df_photoperiod`).# Libraries and setup```{r}# Librarieslibrary(tidyverse)library(arrow)library(glue)library(reactable)library(latex2exp)library(splines)library(forecast)library(plotly)# Colorscolors_custom <-c("#014e25","#800080","#ffa500","#008080","#ff6347","#0000cd")# Theme ggplot2theme_set(theme_bw() +theme(legend.position ="top"))# Functions## These functions are not required in this documentfunctions_exclude <-c("../source/r/module-EDA","../source/r/extractPhotoperiod.R","../source/r/getWeatherPOWER.R" )dir_functions <- fs::dir_ls("../source/r/")c(dir_functions[!dir_functions %in% functions_exclude], fs::dir_ls("../source/r/module-EDA/")) |>walk(.f = source)# Inputscountries <-c("Japan", "Switzerland", "South Korea", "USA-WDC")```# Databases- In total there are 490 different coordinates.- **`df_full`:** complete data with locations (location and country), coordinates and response variable. I select only the variables of interest for this analysis. - **Total rows:** 17966 - **Total columns:** 8- **`df_weather`:** weather data for each coordinate - **Total rows:** 7738651 - **Total columns:** 17 - **Date range:** 1981-01-01 until 2024-02-25- **`df_photoperiod`:** photoperiod data for each coordinate. - **Total rows:** 9456891 - **Total columns:** 3 - **Date range:** 0811-04-02 until 2023-12-30- From the information provided by *NPN* (`df_npn`) I take the `Site_ID == 32789` and the `Species_ID == 228` to identify the records of the site of interest in New York.# Data structuring and initial *Feature Engineering* (FE)- In the database `data_bloom_year` I keep the coordinates of interest and the year of registration.- I obtain a table with summary metrics (`data_summary_weather`) per year (since 1981) for each coordinate and for each of the climate variables, the calculated metrics were the *average*, *median*, *messtandard deviation*, *minimum * and *maximum*. With this table it is possible to relate climatic information from the same flowering year or information from delayed years. Finally I create a database (`data_weather_yearly`) with the response variable `bloom_doy`, the coordinates, the year and the climate information. This table could also be used for model building.- I obtain a table with summary metrics (`data_summary_weather_monthly`) by year and month (since 1981) for each coordinate and for each of the climate variables, the calculated metrics were the *average*, *median*, *messtandard deviation*, *minimum and maximum*. With this table it is possible to relate climatic information from the same year and month of flowering or information from delayed years and months. Finally I create a database (`data_weather_yearly`) with the response variable `bloom_doy`, the coordinates, the year, month and the climate information. This table could also be used for model building.- Having climatic variables summarized on a monthly basis has advantages and disadvantages. Personally I think that the greatest advantage comes from the level of detail in the climate profile of the locations, however, since there are 9 climate variables, three summary metrics (*average*, *median* and *standard deviation*) and 12 months implies that 324 new variables are generated, to these variables are added the 12 that I obtain with the variable `FROST_DAYS` for a total of 336 possible predictors. As many of these variables are correlated, using multivariate techniques for dimensionality reduction allows improving characterization in a smaller dimensional space.- Although I have photoperiod information from the year 81 (one year ago from the first date at each latitude), taking into account that the climate information is from 1981, I also apply this filter to the photoperiod records. I obtain three summary metrics (*average*, *median* and *standard deviation*) for the photoperiod by year (`data_photop_coords_yearly`) and by year-month (`data_photop_coords_monthly`), with which 36 new variables are generated.```{r}# ------ DATA ------------------------------# Data with coordinates and response variabledf_full_complete <-read_parquet("../external-data/df_full.parquet") |>mutate(location =if_else( location =="Switzerland/Z\xfcrich-Albisg\xfcetli",true ="Switzerland/Zürich-Albisgüetli",false = location ),location =if_else( location =="Switzerland/M\xf6hlin",true ="Switzerland/Möhlin",false = location ),location =if_else( location =="Switzerland/W\xe4denswil",true ="Switzerland/Wädenswil",false = location ),location =if_else( location =="Switzerland/Z\xfcrich-MeteoSchweiz",true ="Switzerland/Zürich-MeteoSchweiz",false = location ),location =if_else( location =="Switzerland/Z\xfcrich-Witikon",true ="Switzerland/Zürich-Witikon",false = location ),location =if_else( location =="Switzerland/Gr\xfcsch",true ="Switzerland/Grüsch",false = location ),location =if_else( location =="Switzerland/N\xe4fels",true ="Switzerland/Näfels",false = location ),location =if_else( location =="Switzerland/Sch\xf6nenwerd",true ="Switzerland/Schönenwerd",false = location ),location =if_else( location =="Switzerland/D\xf6ttingen",true ="Switzerland/Döttingen",false = location ),location =if_else( location =="Switzerland/H\xf6fen",true ="Switzerland/Höfen",false = location ),location =if_else( location =="Switzerland/Le S\xe9pey",true ="Switzerland/Le Sépey",false = location ),location =if_else( location =="Switzerland/La Br\xe9vine",true ="Switzerland/La Brévine",false = location ),location =if_else( location =="Switzerland/Alchenfl\xfch",true ="Switzerland/Alchenflüh",false = location ),location =if_else( location =="Switzerland/Neuch\xe2tel",true ="Switzerland/Neuchâtel",false = location ) )df_full <- df_full_complete |>select(location, country, lat, long, alt, year, bloom_date, bloom_doy) |>distinct_all()# Climate datadf_weather <-read_parquet("../external-data/df_weather.parquet")# Photoperiod datadf_photoperiod <-read_parquet("../external-data/df_photoperiod.parquet")# Data NPNdf_npn <-read_parquet("../external-data/df_npn_usa.parquet")# ------ Coordinates to predict ------------------------------# Coordinates Kyotocoord_kyoto <- df_full |>filter(str_detect(location, "kyoto|Kyoto|KYOTO")) |>distinct(lat, long) |>slice(2) |>mutate(location ="Kyoto (Japan)") |>relocate(location, everything())# Coordinates Liestalcoord_liestal <- df_full |>filter(str_detect(location, "liestal")) |>distinct(lat, long) |>mutate(location ="Liestal-Weideli (Swtizerland)") |>relocate(location, everything())# Coordinates USA WDCcoord_usa_wdc <- df_full |>filter(str_detect(country, "USA-WDC")) |>distinct(lat, long) |>mutate(location ="Washington, D.C. (USA)") |>relocate(location, everything())# Coordinates Vancouvercoord_vancouver <- df_weather |>filter(lat >=49.22& lat <=49.3) |>distinct(lat, long) |>mutate(location ="Vancouver, BC (Canada)") |>relocate(location, everything())# Coordinates New Yorkcoords_ny <- df_npn |>filter(Site_ID ==32789) |>filter(Species_ID ==228) |>distinct(lat, long)coord_usa_ny <- coords_ny |>mutate(location ="New York City, NY (USA)") |>relocate(location, everything())# Dataframe with coordinates to predictdf_coords_predict <-bind_rows(coord_kyoto, coord_liestal, coord_usa_wdc, coord_vancouver, coord_usa_ny) |>rename(location_name = location)# Data frame with coordinates of interestdf_full_coords_predict <- df_full |>filter(lat %in% df_coords_predict$lat) |>left_join(df_coords_predict, by =c("lat", "long"))# Data frame with meteorological information of interest for the coordinates# to be predicteddf_weather_coords_predict <- df_weather |>filter(lat %in% df_full_coords_predict$lat | long %in% df_full_coords_predict$long) |>distinct(lat, long) |>left_join(df_full_coords_predict |>distinct(lat, long, location_name),by =c("lat", "long")) |>slice(-4) |>replace_na(list(location_name ="Liestal-Weideli (Swtizerland)"))# Guarantee only one registration per year in each coordinatedata_bloom_year <- df_full_coords_predict |>group_by(lat, long, year) |>reframe(bloom_doy =max(bloom_doy, na.rm =TRUE),bloom_date_max =max(bloom_date, na.rm =TRUE),altitude =unique(alt),location_name =unique(location_name))# ------ YEARLY CLIMATE ------------------------------# Climate summary by year in coordinates to predictdata_summary_weather1 <- df_weather |>filter(lat %in% df_weather_coords_predict$lat) |>select(-c(MM, DD, DOY, FROST_DAYS)) |>pivot_longer(-c(lat, long, YEAR, YYYYMMDD)) |>group_by(lat, long, YEAR, name) |>reframe(MEAN =mean(value, na.rm =TRUE),MEDIAN =median(value, na.rm =TRUE),STD =sd(value, na.rm =TRUE),MIN =min(value, na.rm =TRUE),MAX =max(value, na.rm =TRUE), ) |>pivot_wider(names_from = name,values_from =-c(lat, long, YEAR, name)) |>rename(year = YEAR)data_summary_weather2 <- df_weather |>filter(lat %in% df_weather_coords_predict$lat) |>select(lat, long, YEAR, YYYYMMDD, FROST_DAYS) |>pivot_longer(-c(lat, long, YEAR, YYYYMMDD)) |>group_by(lat, long, YEAR, name) |>reframe(SUM =sum(value, na.rm =TRUE) ) |>pivot_wider(names_from = name,values_from =-c(lat, long, YEAR, name)) |>rename(year = YEAR) |>rename(TOTAL_FROST_DAYS = FROST_DAYS)data_summary_weather <-left_join(data_summary_weather1, data_summary_weather2,by =c("lat", "long", "year"))# Annual climate data# Join by long data_weather_yearly <-left_join(data_bloom_year |>mutate(lat =as.character(lat),long =as.character(long)), data_summary_weather|>mutate(lat =as.character(lat),long =as.character(long)),by =c("long", "year")) |>filter(year >1980) |>rename(lat = lat.x) |>select(-lat.y) |>relocate(lat, long, everything()) |>mutate(lat =as.numeric(lat),long =as.numeric(long))# ------ MONTHLY CLIMATE ------------------------------# Climate summary by year and month in coordinates to predictdata_summary_weather_monthly1 <- df_weather |>filter(lat %in% df_weather_coords_predict$lat) |>select(-c(DD, DOY, FROST_DAYS)) |>pivot_longer(-c(lat, long, YEAR, MM, YYYYMMDD)) |>group_by(lat, long, YEAR, MM, name) |>reframe(MEAN =mean(value, na.rm =TRUE),MEDIAN =median(value, na.rm =TRUE),STD =sd(value, na.rm =TRUE) ) |>pivot_wider(names_from =c(MM, name),values_from =-c(lat, long, YEAR, MM, name)) |>rename(year = YEAR)data_summary_weather_monthly2 <- df_weather |>filter(lat %in% df_weather_coords_predict$lat) |>select(lat, long, YEAR, MM, YYYYMMDD, FROST_DAYS) |>group_by(lat, long, YEAR, MM) |>reframe(TOTAL_FROST_DAYS =sum(FROST_DAYS, na.rm =TRUE)) |>pivot_wider(names_from =c(MM),values_from =-c(lat, long, YEAR, MM)) |>set_names(c("lat", "long", "year", str_c("TOTAL_FROST_DAYS_M", 1:12)))data_summary_weather_monthly <-left_join(data_summary_weather_monthly1, data_summary_weather_monthly2,by =c("lat", "long", "year"))# Monthly climate data# Join by longdata_weather_monthly <-left_join( data_bloom_year |>mutate(lat =as.character(lat),long =as.character(long)), data_summary_weather_monthly |>mutate(lat =as.character(lat),long =as.character(long)),by =c("long", "year") ) |>filter(year >1980) |>rename(lat = lat.x) |>select(-lat.y) |>relocate(lat, long, everything()) |>mutate(lat =as.numeric(lat),long =as.numeric(long))# ------ PHOTOPERIOD ------------------------------data_photop_coords <- df_photoperiod |>filter(lat %in% df_coords_predict$lat) |>filter(year(date_photoperiod) >1980) |>mutate(year =year(date_photoperiod))data_photop_coords_yearly <- data_photop_coords |>group_by(year, lat) |>reframe(mean_photop =mean(photoperiod, na.rm =TRUE),medianphotop =median(photoperiod, na.rm =TRUE),std_photop =sd(photoperiod, na.rm =TRUE) )data_photop_coords_monthly <- data_photop_coords |>mutate(month_rec =month(date_photoperiod)) |>group_by(year, month_rec, lat) |>reframe(mean_photop =mean(photoperiod, na.rm =TRUE),median_photop =median(photoperiod, na.rm =TRUE),std_photop =sd(photoperiod, na.rm =TRUE) ) |>pivot_wider(names_from = month_rec,values_from =c(mean_photop, median_photop, std_photop) )```# Generalities- I obtain a table that shows the range of dates for which there is a record for each location (coordinate), the number (`n`) of years for which there is a record of flowering is also added.- The record with the minimum flowering date is April 1, 812.- Some locations do not have updated information. The coordinate *Lat: 46.52733 - Long: 8.932936* has 10 records from 1978 to 1987, being the coordinate with the "most outdated" record.```{r}df_full |>group_by(location, lat, long) |>reframe(min_date_doy =min(bloom_date, na.rm =TRUE),max_date_doy =max(bloom_date, na.rm =TRUE),n =n()) |>arrange(min_date_doy) |>mutate(across(where(is.numeric), ~round(.x, digits =4))) |>customReactTable()```# Distribution of days between flowering- To calculate the difference (in days) between flowering $t_i$ with $t_{i-1}$ I take into account coordinates that meet the following characteristics: - Minimum 2 records $(n > 1)$ - Difference between years equal to 1, that is, I only keep coordinates whose blooms have been reported with an annual frequency, it does not matter if it was 20 years ago, the important thing is that there is information year after year.- If we call $\gamma$ the bloom date (`bloom_date`) that has measurements in a year $t$, with $t = 1, 2, ..., k$, where $k$ is the number of records that meet the two restrictions described previously, the distribution shown in the following graphs is the difference of $\gamma_t$ with $\gamma_{t-1}$, that is, the [lag operator with $p = 1$](https://en.wikipedia.org/wiki/Lag_operator) on the difference in days between flowerings. It is important to mention that the user will be able to choose the lag value to generate $(\gamma _{t-p})$, in which case for the first condition $n$ is expressed as $n = p + 1$; The second condition remains the same regardless of the value of $p$.::: panel-tabset## $p = 1\ year$```{r}#| fig-width: 4.5#| fig-height: 3.5#| column: screen#| layout-nrow: 2n_lag_operator <-1res_japan <-lagBloomDate(data = df_full,n_lag = n_lag_operator,country_sel = countries[1],pal_colors = colors_custom )res_kyoto <-lagBloomDate(data = df_full |>filter(str_detect(location, "kyoto|Kyoto|KYOTO")),n_lag = n_lag_operator,country_sel = countries[1],pal_colors = colors_custom )res_switzerland <-lagBloomDate(data = df_full,n_lag = n_lag_operator,country_sel = countries[2],pal_colors = colors_custom )res_southk <-lagBloomDate(data = df_full,n_lag = n_lag_operator,country_sel = countries[3],pal_colors = colors_custom )res_usa_wdc <-lagBloomDate(data = df_full,n_lag = n_lag_operator,country_sel = countries[4],pal_colors = colors_custom )res_usa_ny <-lagBloomDate(data = df_full |>filter(lat %in% coords_ny$lat & long %in% coords_ny$long),n_lag = n_lag_operator,country_sel =NULL,pal_colors = colors_custom )res_japan$plot_diffres_kyoto$plot_diff +labs(subtitle =glue("Kyoto: 2 coordinates with p = {n_lag_operator}"))res_switzerland$plot_diffres_southk$plot_diffres_usa_wdc$plot_diffres_usa_ny$plot_diff```## $p = 5\ years$```{r}#| fig-width: 4.5#| fig-height: 3.5#| column: screen#| layout: [[30,-5, 30, -5, 30], [50, 50]]n_lag_operator <-5res_japan <-lagBloomDate(data = df_full,n_lag = n_lag_operator,country_sel = countries[1],pal_colors = colors_custom )res_kyoto <-lagBloomDate(data = df_full |>filter(str_detect(location, "kyoto|Kyoto|KYOTO")),n_lag = n_lag_operator,country_sel = countries[1],pal_colors = colors_custom )res_switzerland <-lagBloomDate(data = df_full,n_lag = n_lag_operator,country_sel = countries[2],pal_colors = colors_custom )res_southk <-lagBloomDate(data = df_full,n_lag = n_lag_operator,country_sel = countries[3],pal_colors = colors_custom )res_usa_wdc <-lagBloomDate(data = df_full,n_lag = n_lag_operator,country_sel = countries[4],pal_colors = colors_custom )res_japan$plot_diffres_kyoto$plot_diff +labs(subtitle =glue("Kyoto: 2 coordinates with p = {n_lag_operator}"))res_switzerland$plot_diffres_southk$plot_diffres_usa_wdc$plot_diff```:::# DOY lagged- In the following graphs the response variable in year $t$ is related to $t-i$. Unlike the previous graphs in these I use the variable `bloom_doy` and evaluate a decade, that is, $i = 1, 2, ..., 10$.```{r}#| fig-width: 12#| fig-height: 2#| column: screen#| layout-nrow: 5# JapanlagsDOY(data = df_full,country_sel = countries[1],pal_colors = colors_custom)$plot_lags# KyotolagsDOY(data = df_full |>filter(str_detect(location, "kyoto|Kyoto|KYOTO")),country_sel = countries[1],pal_colors = colors_custom)$plot_lags +labs(subtitle ="Kyoto")# SwitzerlandlagsDOY(data = df_full,country_sel = countries[2],pal_colors = colors_custom)$plot_lags# South KorealagsDOY(data = df_full,country_sel = countries[3],pal_colors = colors_custom)$plot_lags# USA-WDClagsDOY(data = df_full,country_sel = countries[4],pal_colors = colors_custom)$plot_lags```# Autocorrelation function (ACF - PACF)- For these charts I get the maximum value of `bloom_doy` per year.```{r}#| fig-width: 4.2#| fig-height: 3#| column: screen#| layout-nrow: 5# JapanplotACFMaxDOY(data = df_full, country_sel = countries[1])$plot1plotACFMaxDOY(data = df_full, country_sel = countries[1])$plot2plotACFMaxDOY(data = df_full, country_sel = countries[1])$plot3# KyotoplotACFMaxDOY(data = df_full |>filter(str_detect(location, "kyoto|Kyoto|KYOTO")),country_sel = countries[1])$plot1 +labs(subtitle ="Kyoto (all data)")plotACFMaxDOY(data = df_full |>filter(str_detect(location, "kyoto|Kyoto|KYOTO")),country_sel = countries[1])$plot2 +labs(subtitle ="Kyoto (all data)")plotACFMaxDOY(data = df_full |>filter(str_detect(location, "kyoto|Kyoto|KYOTO")),country_sel = countries[1])$plot3 +labs(subtitle ="Kyoto (50 years)")# SwitzerlandplotACFMaxDOY(data = df_full, country_sel = countries[2])$plot1plotACFMaxDOY(data = df_full, country_sel = countries[2])$plot2plotACFMaxDOY(data = df_full, country_sel = countries[2])$plot3# South KoreaplotACFMaxDOY(data = df_full, country_sel = countries[3])$plot1plotACFMaxDOY(data = df_full, country_sel = countries[3])$plot2plotACFMaxDOY(data = df_full, country_sel = countries[3])$plot3# USA-WDCplotACFMaxDOY(data = df_full, country_sel = countries[4])$plot1plotACFMaxDOY(data = df_full, country_sel = countries[4])$plot2plotACFMaxDOY(data = df_full, country_sel = countries[4])$plot3```# Doy time series- Given that the historical record for the coordinates is different and that very few have old records, I decide to use data from the year 1950.- In the graph of Japan there are some coordinates with `bloom_doy` values less than 60, these 10 places are listed below: - *Japan/Naze* - *Japan/Naze/Funchatoge* - *Japan/Yonagunijima* - *Japan/Iriomotejima* - *Japan/Ishigakijima* - *Japan/Miyakojima* - *Japan/Kumejima* - *Japan/Naha* - *Japan/Nago* - *Japan/Minamidaitojima*- The red line of the mita graph is the average trend obtained through a GAM model with five degrees of freedom.::: panel-tabset## `r countries[1]````{r}#| fig-width: 4.5#| fig-height: 3.5#| column: screen#| layout-nrow: 1timeSerieDOY(data = df_full,country_sel = countries[1],pal_colors = colors_custom)$plot1timeSerieDOY(data = df_full |>filter(bloom_doy >60),country_sel = countries[1],pal_colors = colors_custom)$plot2 +labs(subtitle ="Japan with DOY > 60")timeSerieDOY(data = df_full |>filter(bloom_doy >60),country_sel = countries[1],pal_colors = colors_custom)$plot3 +labs(subtitle ="Japan with DOY > 60")```## `r countries[1]` - Kyoto```{r}#| fig-width: 4.5#| fig-height: 3.5#| column: screen#| layout-nrow: 1timeSerieDOY(data = df_full |>filter(str_detect(location, "kyoto|Kyoto|KYOTO")),country_sel = countries[1],pal_colors = colors_custom)$plot1 +labs(subtitle ="Kyoto")timeSerieDOY(data = df_full |>filter(str_detect(location, "kyoto|Kyoto|KYOTO")),country_sel = countries[1],pal_colors = colors_custom)$plot2 +labs(subtitle ="Kyoto")timeSerieDOY(data = df_full |>filter(str_detect(location, "kyoto|Kyoto|KYOTO")),country_sel = countries[1],pal_colors = colors_custom)$plot3 +labs(subtitle ="Kyoto")```## `r countries[2]````{r}#| fig-width: 4.5#| fig-height: 3.5#| column: screen#| layout-nrow: 1timeSerieDOY(data = df_full,country_sel = countries[2],pal_colors = colors_custom)$plot1timeSerieDOY(data = df_full,country_sel = countries[2],pal_colors = colors_custom)$plot2timeSerieDOY(data = df_full,country_sel = countries[2],pal_colors = colors_custom)$plot3```## `r countries[3]````{r}#| fig-width: 4.5#| fig-height: 3.5#| column: screen#| layout-nrow: 1timeSerieDOY(data = df_full,country_sel = countries[3],pal_colors = colors_custom)$plot1timeSerieDOY(data = df_full,country_sel = countries[3],pal_colors = colors_custom)$plot2timeSerieDOY(data = df_full,country_sel = countries[3],pal_colors = colors_custom)$plot3```## `r countries[4]````{r}#| fig-width: 4.5#| fig-height: 3.5#| column: screen#| layout-nrow: 1timeSerieDOY(data = df_full,country_sel = countries[4],pal_colors = colors_custom)$plot1timeSerieDOY(data = df_full,country_sel = countries[4],pal_colors = colors_custom)$plot2timeSerieDOY(data = df_full,country_sel = countries[4],pal_colors = colors_custom)$plot3```## USA-NY```{r}#| fig-width: 4.5#| fig-height: 3.5#| column: screen#| layout-nrow: 1timeSerieDOY(data = df_full |>filter(lat %in% coords_ny$lat & long %in% coords_ny$long),country_sel =NULL,pal_colors = colors_custom)$plot1timeSerieDOY(data = df_full |>filter(lat %in% coords_ny$lat & long %in% coords_ny$long),country_sel =NULL,pal_colors = colors_custom)$plot2timeSerieDOY(data = df_full |>filter(lat %in% coords_ny$lat & long %in% coords_ny$long),country_sel =NULL,pal_colors = colors_custom)$plot3```:::# Time series: coordinates of interest- I extract only the coordinates of interest to the competition.::: panel-tabset## Static chart```{r}#| fig-width: 6.2#| fig-height: 4#| column: screen#| layout-nrow: 1g1 <- df_full_coords_predict |>ggplot(aes(x = year, y = bloom_doy)) +facet_wrap( ~ location_name, scales ="free") +geom_line(color = colors_custom[1],alpha =0.65) +geom_smooth(method ="gam",formula = y ~ns(x, df =5),color = colors_custom[2],se =TRUE,size =0.7 ) +labs(x ="Year",y ="DOY",title ="DOY time series",subtitle ="Original variable - Historical" )g2 <- df_full_coords_predict |>filter(year >1950) |>group_by(location_name, lat, long) |>mutate(bloom_doy_stand =scale(bloom_doy, scale =TRUE, center =TRUE)) |>ungroup() |>ggplot(aes(x = year, y = bloom_doy_stand, color = location_name)) +geom_point(size =0.65) +geom_line(size =0.5, alpha =0.5) +geom_hline(yintercept =0,lty =2,color ="gray15",size =0.2 ) +geom_smooth(method ="gam",formula = y ~ns(x, df =5),color ="gray15",se =TRUE,size =0.7 ) +labs(x ="Year",y = latex2exp::TeX(r'($DOY\ \left(\frac{x_i - \mu}{\sigma}\right)$)'),title ="DOY time series",subtitle ="Standardized variable - since 1950",color ="" ) +scale_color_manual(values = colors_custom) +guides(color =guide_legend(nrow =2))g1g2```## Interactive chart```{r}#| fig-width: 6.2#| fig-height: 4#| column: screen#| layout-nrow: 1ggplotly(g2 +labs(y ="Standardized DOY", title ="Standardized variable - since 1950")) |>layout(legend =list(orientation ="h",xanchor ="center",y =-0.2,x =0.5,traceorder ="normal",font =list(size =12),bgcolor ="white",bordercolor ="#444",borderwidth =1 ) )```:::# Yearly weather vs DOY- The meteorological data were obtained with R from the [Prediction of Worldwide Energy Resources (POWER)](https://power.larc.nasa.gov/#resources) project, using the [`nasapower`](https://docs.ropensci.org/nasapower/) for extraction of time series (from 1981-01-01 to 2024-02-25) with daily frequency of the following variables: - Average Earth Surface Temperature (TS) - Maximum temperature of the earth's surface (TS_MAX) - Minimum land surface temperature (TS_MIN) - Evaporation from the earth's surface (EVLAND) - Frost days (FROST_DAYS) - Precipitation (PRECTOTCORR) - Soil moisture profile (GWETPROF) - Relative humidity at two meters (RH2M) - Soil moisture in the root zone (GWETROOT) - The total amount of ozone in a column extending vertically from the Earth's surface to the top of the atmosphere (TO3)- To see the description of each parameter, see [POWER resources.](https://power.larc.nasa.gov/#resources)- As the POWER project is aimed at three user [communities](https://power.larc.nasa.gov/docs/methodology/communities/), in this case the Agroclimatology community (**AG**) is used.- For these graphs I filter out records that are not from New York since the DOY is very high compared to the other four sites.- It is important to mention that these graphs are built with the data summarized by year (`data_weather_yearly`), however, later the relationship of the climate summarized on a monthly basis (`data_weather_monthly`) is evaluated.## General```{r}#| fig-width: 7#| fig-height: 19#| column: screen#| layout-nrow: 1scatterYearlyClimate(data = data_weather_yearly,list_coords = df_coords_predict$lat,n_lag =1,loc_name =NULL,pal_colors = colors_custom)$plot1scatterYearlyClimate(data = data_weather_yearly,list_coords = df_coords_predict$lat,n_lag =1,loc_name =NULL,pal_colors = colors_custom)$plot2```## By location with p=1- Charts are shown only for locations that have sufficient information.::: panel-tabset### `r df_coords_predict$location_name[1]````{r}#| fig-width: 9#| fig-height: 19#| column: page-inset-rightscatterYearlyClimate(data = data_weather_yearly,list_coords = df_coords_predict$lat,n_lag =1,loc_name = df_coords_predict$location_name[1],pal_colors = colors_custom)```### `r df_coords_predict$location_name[2]````{r}#| fig-width: 9#| fig-height: 19#| column: page-inset-rightscatterYearlyClimate(data = data_weather_yearly,list_coords = df_coords_predict$lat,n_lag =1,loc_name = df_coords_predict$location_name[2],pal_colors = colors_custom)```### `r df_coords_predict$location_name[3]````{r}#| fig-width: 9#| fig-height: 19#| column: page-inset-rightscatterYearlyClimate(data = data_weather_yearly,list_coords = df_coords_predict$lat,n_lag =5,loc_name = df_coords_predict$location_name[3],pal_colors = colors_custom)```:::## By location with p=5::: panel-tabset### `r df_coords_predict$location_name[1]````{r}#| fig-width: 9#| fig-height: 19#| column: page-inset-rightscatterYearlyClimate(data = data_weather_yearly,list_coords = df_coords_predict$lat,n_lag =5,loc_name = df_coords_predict$location_name[1],pal_colors = colors_custom)```### `r df_coords_predict$location_name[2]````{r}#| fig-width: 9#| fig-height: 19#| column: page-inset-rightscatterYearlyClimate(data = data_weather_yearly,list_coords = df_coords_predict$lat,n_lag =5,loc_name = df_coords_predict$location_name[2],pal_colors = colors_custom)```### `r df_coords_predict$location_name[3]````{r}#| fig-width: 9#| fig-height: 19#| column: page-inset-rightscatterYearlyClimate(data = data_weather_yearly,list_coords = df_coords_predict$lat,n_lag =5,loc_name = df_coords_predict$location_name[3],pal_colors = colors_custom)```:::## By location with p=10::: panel-tabset### `r df_coords_predict$location_name[1]````{r}#| fig-width: 9#| fig-height: 19#| column: page-inset-rightscatterYearlyClimate(data = data_weather_yearly,list_coords = df_coords_predict$lat,n_lag =10,loc_name = df_coords_predict$location_name[1],pal_colors = colors_custom)```### `r df_coords_predict$location_name[2]````{r}#| fig-width: 9#| fig-height: 19#| column: page-inset-rightscatterYearlyClimate(data = data_weather_yearly,list_coords = df_coords_predict$lat,n_lag =10,loc_name = df_coords_predict$location_name[2],pal_colors = colors_custom)```### `r df_coords_predict$location_name[3]````{r}#| fig-width: 9#| fig-height: 19#| column: page-inset-rightscatterYearlyClimate(data = data_weather_yearly,list_coords = df_coords_predict$lat,n_lag =10,loc_name = df_coords_predict$location_name[3],pal_colors = colors_custom)```:::# Monthly weather vs DOY- These graphs show the correlation profile with the 50 variables with the greatest association (positive or negative).```{r}#| fig-width: 4.5#| fig-height: 10#| column: screen#| layout-nrow: 1corrMonthlyClimate(data = data_weather_monthly,list_coords = df_coords_predict$lat,n_lag =1,loc_name = df_coords_predict$location_name[1],pal_colors = colors_custom)$plot1corrMonthlyClimate(data = data_weather_monthly,list_coords = df_coords_predict$lat,n_lag =1,loc_name = df_coords_predict$location_name[2],pal_colors = colors_custom)$plot1corrMonthlyClimate(data = data_weather_monthly,list_coords = df_coords_predict$lat,n_lag =1,loc_name = df_coords_predict$location_name[3],pal_colors = colors_custom)$plot1corrMonthlyClimate(data = data_weather_monthly,list_coords = df_coords_predict$lat,n_lag =1,loc_name =NULL,pal_colors = colors_custom)$plot1```